#### **Cache Memory Optimizations**

## Cache Measures

- *Hit rate*: fraction found in that level
   So high that usually talk about *Miss rate*
- Miss penalty: time to replace a block from lower level, including time to replace in CPU
  - access time: time to lower level
    - = f(latency to lower level)
  - *transfer time*: time to transfer block
    - =f(BW between upper & lower levels)

#### 4 Questions for Memory Hierarchy

- Q1: Where can a block be placed in the upper level? (*Block placement*)
- Q2: How is a block found if it is in the upper level? (Block identification)
- Q3: Which block should be replaced on a miss? (Block replacement)
- Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level?

- Block 12 placed in 8 block cache:
  - Fully associative, direct mapped, 2-way set associative
  - S.A. Mapping = Block Number Modulo Number Sets



#### Q2: How is a block found if it is in the upper level?

- Tag on each block
  - No need to check index or block offset
- Increasing associativity shrinks index, expands tag

| Block Address | Block  |
|---------------|--------|
| Tag           | Offset |

# Q3: Which block should be replaced on a miss?

- Easy for Direct Mapped
- Set Associative or Fully Associative:
  - Random
  - LRU (Least Recently Used)

| Assoc: | 2-way |       | 4-way | /     | 8-wa  | ау    |
|--------|-------|-------|-------|-------|-------|-------|
| Size   | LRU   | Ran   | LRU   | Ran   | LRU   | Ran   |
| 16 KB  | 5.2%  | 5.7%  | 4.7%  | 5.3%  | 4.4%  | 5.0%  |
| 64 KB  | 1.9%  | 2.0%  | 1.5%  | 1.7%  | 1.4%  | 1.5%  |
| 256 KB | 1.15% | 1.17% | 1.13% | 1.13% | 1.12% | 1.12% |

## Q4: What happens on a write?

|                                                  | Write-Through                            | Write-Back                                                      |
|--------------------------------------------------|------------------------------------------|-----------------------------------------------------------------|
| Policy                                           | Data written to<br>cache block           | Write data only<br>to the cache                                 |
|                                                  | also written to<br>lower-level<br>memory | Update lower<br>level when a<br>block falls out of<br>the cache |
| Do read misses produce writes?                   | No                                       | Yes                                                             |
| Do repeated<br>writes make it to<br>lower level? | Yes                                      | No                                                              |

#### **Causes of misses**

- Compulsory
  - First reference to a block
- Capacity
  - Blocks discarded and later retrieved
- Conflict
  - Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache

## Memory Hierarchy Basics

#### Basic cache optimizations:

- Larger block size
  - Reduces compulsory misses
  - Increases capacity and conflict misses, increases miss penalty
- Larger total cache capacity to reduce miss rate
  - Increases hit time, increases power consumption
- Higher associativity
  - Reduces conflict misses
  - Increases hit time, increases power consumption
- Higher number of cache levels
  - Reduces overall memory access time

## Ten Advanced Optimizations

- Small and simple first level caches
  - Critical timing path:
    - addressing tag memory, then
    - comparing tags, then
    - selecting correct set
  - Direct-mapped caches can overlap tag compare and transmission of data
  - Lower associativity reduces power because fewer cache lines are accessed

#### L1 Size and Associativity



Access time vs. size and associativity

#### L1 Size and Associativity



Energy per read vs. size and associativity

## Way Prediction

- To improve hit time, predict the way to pre-set mux
  - Mis-prediction gives longer hit time
  - Prediction accuracy
    - > 90% for two-way
    - > 80% for four-way
- Extend to predict block as well
  - "Way selection"
  - Increases mis-prediction penalty

# **Pipelining Cache**

- Pipeline cache access to improve bandwidth
  - Examples:
    - Pentium: 1 cycle
    - Pentium Pro Pentium III: 2 cycles
    - Pentium 4 Core i7: 4 cycles
- Increases branch mis-prediction penalty
- Makes it easier to increase associativity

## **Nonblocking Caches**

- Allow hits before previous misses complete
  - "Hit under miss"
  - "Hit under multiple miss"
- Very Effective in Out of order Execution

## **Multibanked Caches**

- Organize cache as independent banks to support simultaneous access
  - ARM Cortex-A8 supports 1-4 banks for L2
  - Intel i7 supports 4 banks for L1 and 8 banks for L2
- Interleave banks according to block address



**Figure 2.6** Four-way interleaved cache banks using block addressing. Assuming 64 bytes per blocks, each of these addresses would be multiplied by 64 to get byte addressing.

# Critical Word First, Early Restart

- Critical word first
  - Request missed word from memory first
  - Send it to the processor as soon as it arrives
- Early restart
  - Request words in normal order
  - Send missed work to the processor as soon as it arrives
- Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched

## Merging Write Buffer

- When storing to a block that is already pending in the write buffer, update write buffer
- Reduces stalls due to full write buffer



## **Compiler Optimizations**

- McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4B blocks <u>in software</u>
- Instructions
  - Reorder procedures in memory so as to reduce conflict misses
  - Profiling to look at conflicts (using tools they developed)
- Data
  - Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays
  - Loop Interchange: change nesting of loops to access data in order stored in memory
  - Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
  - Blocking: Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows

## Merging Arrays Example

```
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of stuctures */
struct merge {
   int val;
   int key;
};
struct merge merged_array[SIZE];
```

Reducing conflicts between val & key; improve spatial locality

#### Loop Interchange Example

Sequential accesses instead of striding through memory every 100 words; improved spatial locality

#### Loop Fusion Example

2 misses per access to a & c vs. one miss per access; improve spatial locality

## Hardware Prefetching

Fetch two blocks on miss (include next sequential block)

## **Compiler Prefetching**

- Insert prefetch instructions before data is needed
- Register prefetch
  - Loads data into register
- Cache prefetch
  - Loads data into cache

#### Summary

| Technique                                        | Hit<br>time | Band-<br>width | Miss<br>penalty | Miss<br>rate | Power<br>consumption | Hardware cost/<br>complexity | Comment                                                                                                              |
|--------------------------------------------------|-------------|----------------|-----------------|--------------|----------------------|------------------------------|----------------------------------------------------------------------------------------------------------------------|
| Small and simple<br>caches                       | +           |                |                 | -            | +                    | 0                            | Trivial; widely used                                                                                                 |
| Way-predicting caches                            | +           |                |                 |              | +                    | 1                            | Used in Pentium 4                                                                                                    |
| Pipelined cache access                           | _           | +              |                 |              |                      | 1                            | Widely used                                                                                                          |
| Nonblocking caches                               |             | +              | +               |              |                      | 3                            | Widely used                                                                                                          |
| Banked caches                                    |             | +              |                 |              | +                    | 1                            | Used in L2 of both i7 and<br>Cortex-A8                                                                               |
| Critical word first<br>and early restart         |             |                | +               |              |                      | 2                            | Widely used                                                                                                          |
| Merging write buffer                             |             |                | +               |              |                      | 1                            | Widely used with write through                                                                                       |
| Compiler techniques to reduce cache misses       |             |                |                 | +            |                      | 0                            | Software is a challenge, but<br>many compilers handle<br>common linear algebra<br>calculations                       |
| Hardware prefetching<br>of instructions and data |             |                | +               | +            | -                    | 2 instr.,<br>3 data          | Most provide prefetch<br>instructions; modern high-<br>end processors also<br>automatically prefetch in<br>hardware. |
| Compiler-controlled prefetching                  |             |                | +               | +            |                      | 3                            | Needs nonblocking cache;<br>possible instruction overhead;<br>in many CPUs                                           |

**Figure 2.11** Summary of 10 advanced cache optimizations showing impact on cache performance, power consumption, and complexity. Although generally a technique helps only one factor, prefetching can reduce misses if done sufficiently early; if not, it can reduce miss penalty. + means that the technique improves the factor, – means it hurts that factor, and blank means it has no impact. The complexity measure is subjective, with 0 being the easiest and 3 being a challenge.